bandit learning algorithm and application
A Bandit Learning Algorithm and Applications to Auction Design
We consider online bandit learning in which at every time step, an algorithm has to make a decision and then observe only its reward. The goal is to design efficient (polynomial-time) algorithms that achieve a total reward approximately close to that of the best fixed decision in hindsight. In this paper, we introduce a new notion of $(\lambda,\mu)$-concave functions and present a bandit learning algorithm that achieves a performance guarantee which is characterized as a function of the concavity parameters $\lambda$ and $\mu$. The algorithm is based on the mirror descent algorithm in which the update directions follow the gradient of the multilinear extensions of the reward functions. The regret bound induced by our algorithm is $\widetilde{O}(\sqrt{T})$ which is nearly optimal.
Review for NeurIPS paper: A Bandit Learning Algorithm and Applications to Auction Design
Additional Feedback: The paper studies the problem of online convex optimization problem, except that the functions that arrive online are not really concave. They are "close to" concave, formalized by the paper as (lambda, mu) concavity. The idea is that in many problems of interest where the input functions are not concave, the paper discretizes the function and consider the multilinear extension of the discretized function, which happens to be (lambda, mu) concave for reasonable values of lambda and mu. The paper presents three applications to illustrate the value of their approach. The first of these is the analysis of adaptive dynamics on (lambda, mu) smooth games where previously high welfare was known to be guaranteed (i.e., average welfare of playing the dynamics over time is at least lambda/mu of the optimal welfare) only for dynamics that had vanishing regret for each player.
A Bandit Learning Algorithm and Applications to Auction Design
We consider online bandit learning in which at every time step, an algorithm has to make a decision and then observe only its reward. The goal is to design efficient (polynomial-time) algorithms that achieve a total reward approximately close to that of the best fixed decision in hindsight. In this paper, we introduce a new notion of (\lambda,\mu) -concave functions and present a bandit learning algorithm that achieves a performance guarantee which is characterized as a function of the concavity parameters \lambda and \mu . The algorithm is based on the mirror descent algorithm in which the update directions follow the gradient of the multilinear extensions of the reward functions. The regret bound induced by our algorithm is \widetilde{O}(\sqrt{T}) which is nearly optimal.